4b) Errors and power — torturing the t-test

Data Simulation with Monte Carlo Methods

Marko Bachl

University of Hohenheim

Errors and power — torturing the t-test, part II

Traktandenliste

  1. Introduction & overview

  2. Some types and examples of simulation studies

  3. Proof by simulation — the Central Limit Theorem (CLT)

  4. Errors and power — torturing the t-test

  5. Misclassification and bias — Messages mismeasured

  6. Outlook: What’s next?

4. Errors and power — torturing the t-test

  1. Comparison of Student’s and Welch’s t-tests

  2. A priori power calculation for Welch’s t-test

Student’s t-test & Welch’s t-test

  • Old school advice: Student’s t-test for equal variances, Welch’s t-test for unequal variances.

    • Higher power of Student’s t-test if assumptions hold.
  • Modern advice: Always use Welch’s t-test.

    • Better if assumptions are violated; not worse if assumptions hold.

    • e.g., Delacre, M., Lakens, D., & Leys, C. (2017). Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1). https://doi.org/10.5334/irsp.82

  • For those who don’t care about t-tests: Idea also applies to heteroskedasticity-consistent standard errors.

Second Simulation study: Power

  1. Question: What is the goal of the simulation?

    • Comparison of Student’s and Welch’s t-tests
  2. Quantities of interest: What is measured in the simulation?

    • Power (\(1 - \beta\)) of the tests
  3. Evaluation strategy: How are the quantities assessed?

    • Comparison of statistical power to reject the Null hypothesis if the alternative hypothesis is true
  4. Conditions: Which characteristics of the data generating model will be varied?

    • Sample size, effect size (mean difference)
  5. Data generating model: How are the data simulated?

    • Random draws from normal distributions with different group means and different overall sample sizes, but euqal group sizes and standard deviations

Additional practical considerations

  • How to adapt the simulation function to include different group means?

  • How to adapt the experimental conditions

    • to include different group means and different total sample sizes?

    • to only include conditions in which the assumptions of Student’s t-test hold?

Adapt the simulation function

  • Include new argument M_diff

    • Mean of the treatment condition; equals mean difference, as mean of the control is fixed at 0.

    • Mean difference equals standardized effect size d if SDR, and, consequently, both group standard deviations, are 1.

sim_ttest = function(n = 200, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = 1
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1) # control
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2) # treatment
  welch = t.test(g1, g2)$p.value
  student = t.test(g1, g2, var.equal = TRUE)$p.value
  res = list("Welch" = welch, "Student" = student)
  res
}
sim_ttest(n = 100, M_diff = 0.3)
$Welch
[1] 0.2752475

$Student
[1] 0.2752067

Adapt the experimental conditions

  • GR and SDR are now fixed at 1.

  • New between-simulation factors: n and M_diff

conditions = expand_grid(
  GR = 1, 
  SDR = 1,
  n = c(100, 200, 300),
  M_diff = c(0.2, 0.4, 0.6)) %>% 
  rowid_to_column(var = "condition")
conditions
# A tibble: 9 × 5
  condition    GR   SDR     n M_diff
      <int> <dbl> <dbl> <dbl>  <dbl>
1         1     1     1   100    0.2
2         2     1     1   100    0.4
3         3     1     1   100    0.6
4         4     1     1   200    0.2
5         5     1     1   200    0.4
6         6     1     1   200    0.6
7         7     1     1   300    0.2
8         8     1     1   300    0.4
9         9     1     1   300    0.6

Run simulation experiment

set.seed(326)
i = 10000
tic()
sims = map_dfr(1:i, ~ conditions) %>% # each condition i times
  rowid_to_column(var = "sim") %>% # within simulation comparison
  rowwise() %>%
  mutate(p.value = list(sim_ttest(n = n, M_diff = M_diff))) %>% 
  unnest_longer(p.value, indices_to = "method")
toc()
22.822 sec elapsed

Check results: Simulation v analytical solution

sims %>% 
  filter(M_diff == 0.4 & method == "Student") %>% 
  group_by(n, M_diff, method) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ungroup() %>% 
  mutate(across(where(is.numeric), round, 3))
# A tibble: 3 × 4
      n M_diff method  P_p05
  <dbl>  <dbl> <chr>   <dbl>
1   100    0.4 Student 0.505
2   200    0.4 Student 0.802
3   300    0.4 Student 0.928
power.t.test(n = c(100, 200, 300)/2,
             delta = 0.4, sd = 1,
             sig.level = 0.05)

     Two-sample t test power calculation 

              n = 50, 100, 150
          delta = 0.4
             sd = 1
      sig.level = 0.05
          power = 0.5081451, 0.8036466, 0.9322752
    alternative = two.sided

NOTE: n is number in *each* group
  • Close enough for government work

Results: Power comparison

Show the code
sims %>% 
  group_by(n, M_diff, method) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(n), P_p05, color = method, group = method)) +
  geom_point(position = position_dodge(width = 0.3), size = 3) + 
  geom_line(position = position_dodge(width = 0.3), size = 1.5) +
  facet_wrap(vars(M_diff), labeller = label_both) +
  labs(x = "n", y = "Power")

Summary of the t-test comparison

  • Student’s t-test fails the nominal false discovery rate if variances and group sizes are unequal (simulation study 1).

  • Welch’s t-test has similar statistical power as Student’s t-test if the assumption of equal group variances holds (simulation study 2).

  • We should always use Welch’s t-test as a default.

  • For those who don’t use t-tests: We probably should also use heteroskedasticity-consistent standard errors as default in OLS linear models.

Lessons learnt for simulation experiments

  • TODO

Questions?

Power estimation for Welch’s t-test

Power estimation for Welch’s t-test

  • So far we have learnt that we should prefer Welch’s t-test over Student’s t-test when comparing two group means.

  • Is seems plausible that different group sizes and group standard deviations may influence statistical power.

  • However, the standard options for a priori power calculation often do not allow for varying group sizes and varying group SDs.

  • This is a good segue to using Monte Carlo simulation experiments for a priori power calculation with a (seemingly) simple example.

Getting started: Some preliminary thoughts

  • We know that effect size and total sample size are positively related to statistical power, so we are not that interested in their effects.

  • We know less about the role of group sample sizes and group standard deviations differences, so this is what we want to investigate in the simulation.

  • We have to think more about how we define the effect size.

    • This is generally one of the harder tasks in a priori power calculation

    • and it becomes more complicated if simple “mean divided by SD” rules of thumb get more ambiguous.

Third Simulation study: Power (again)

  1. Question: What is the goal of the simulation?

    • Understand roles of group sample sizes and group standard deviations for statistical power
  2. Quantities of interest: What is measured in the simulation?

    • Power (\(1 - \beta\)) of the designs using Welch’s t-test
  3. Evaluation strategy: How are the quantities assessed?

    • Comparison of statistical power to reject the Null hypothesis if the alternative hypothesis is true
  4. Conditions: Which characteristics of the data generating model will be varied?

    • Group sample sizes, group standard deviations
  5. Data generating model: How are the data simulated?

    • Random draws from normal distributions with different euqal group sizes and standard deviations, but fixed group mean differences and overall sample size.

First try: Adapt the simulation function

  • We no longer need Student’s t-test.

    • The output is simply the p-value of Welch’s t-test.
  • Beyond that, the simulation function stays the same.

sim_ttest = function(n = 200, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = 1
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1)
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2)
  res = t.test(g1, g2)$p.value
  return(res)
}
sim_ttest(n = 300, M_diff = 0.3, GR = 0.5)
[1] 0.1974923

First try: Adapt the experimental conditions

  • Total sample size n = 300 and effect size M_diff = 0.3 are now fixed.

  • New between-simulation factors: GR and SDR

conditions = expand_grid(GR = c(0.5, 1, 2), 
                         SDR = c(0.5, 1, 2),
                         n = 300,
                         M_diff = 0.3) %>% 
  rowid_to_column(var = "condition") %>% 
  mutate(
    n1 = round(n / (GR + 1)),
    n2 = round(n1 * GR),
    sd1 = 1,
    sd2 = sd1 * SDR
    ) 
conditions %>% select(1:5)
# A tibble: 9 × 5
  condition    GR   SDR     n M_diff
      <int> <dbl> <dbl> <dbl>  <dbl>
1         1   0.5   0.5   300    0.3
2         2   0.5   1     300    0.3
3         3   0.5   2     300    0.3
4         4   1     0.5   300    0.3
5         5   1     1     300    0.3
6         6   1     2     300    0.3
7         7   2     0.5   300    0.3
8         8   2     1     300    0.3
9         9   2     2     300    0.3

This has some implications

  • For effect size interpretation:

    • If the SD ratio is not equal to 1, the pooled SD is not equal to 1 and the mean difference is not in units of the pooled SD (as in traditional d).

    • Instead, it is the difference in SD of the control group units.

    • Sensible definition for intervention trials with passive control group: Estimate of the population variation without a intervention.

    • Maybe less sensible in other designs with more active controls (e.g., typical media effects experimental studies).

This has some implications

  • For substantive interpretation:

    • SDR = 0.5 assumes a treatment which not only increases the level of the outcome by M_diff, but also halves the variation in the outcome.

    • SDR = 2 assumes a treatment which not only increases the level of the outcome by M_diff, but also doubles the variation in the outcome.

    • Heterogeneous treatment effects!

First try: Run simulation experiment

set.seed(7913)
i = 10000
tic()
sims = map_dfr(1:i, ~ conditions) %>% # each condition 10,000 times
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(p.value = sim_ttest(n = n, M_diff = M_diff, GR = GR, SDR = SDR))
toc()
11.441 sec elapsed

First try: Results

Show the code
sims %>% 
  group_by(GR, SDR) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(GR), P_p05, group = 1)) + 
  geom_point() + geom_line() +
  facet_wrap(vars(SDR), labeller = label_both) +
  labs(x = "Group size ratio", y = "Power")

First try: Results

Show the code
sims %>% 
  group_by(n, M_diff, GR, SDR, n1, n2, sd1, sd2) %>% 
  summarise(P_p05 = mean(p.value < 0.05), .groups = "drop") %>% 
  mutate(implies = str_glue("n_c = {n1}, n_t = {n2}, sd_c = {round(sd1, 2)}, sd_t = {round(sd2, 2)}")) %>% 
  mutate(implies = fct_reorder(implies, P_p05)) %>% 
  ggplot(aes(P_p05, implies)) + geom_point(size = 3) +
  labs(x = "Power", y = "Condition") +
  theme(axis.text.y = element_text(size = 20))

First try: Summary

  • TODO

Second try: Bring back a pooled SD of 1

  • We might not be satisfied with the trivial result that more variation decreases power.

  • We might be interested in a more conventional d-style scaling of the effect size.

  • We want a pooled SD of 1 while retaining the SD ratio design.

Wikipedia to the rescue

Standard deviations of non-overlapping (X ∩ Y = ∅) sub-populations can be aggregated as follows if the size (actual or relative to one another) and means of each are known:

\[ \sigma_{X\cup Y} = \sqrt{ \frac{N_X \sigma_X^2 + N_Y \sigma_Y^2}{N_X + N_Y} + \frac{N_X N_Y}{(N_X+N_Y)^2}(\mu_X - \mu_Y)^2 } \] and

\[ \sigma_Y = SDR * \sigma_X \]

Second try: Adapt the simulation function

  • We add a new argument, pooled_sd.

  • sd1 is computed based on pooled_sd, GR (via n1 and n2), SDR, and M_diff.

sim_ttest = function(n = 200, pooled_sd = 1, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = sqrt(
    ((pooled_sd^2 - ((n1 * n2) / (n1 + n2)^2) * M_diff^2) * (n1 + n2)) / 
      (n1 + n2 * SDR^2)
    )
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1)
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2)
  res = t.test(g1, g2)$p.value
  return(res)
}
sim_ttest(n = 300, M_diff = 0.3, GR = 0.5)
[1] 0.01249033

Second try: Adapt the experimental conditions

  • Total sample size n = 300 and effect size M_diff = 0.3 are now fixed.

  • In addition, pooled_sd is fixed at 1, returning to a d-style interpretation of the effect size.

  • Between-simulation factors: GR and SDR

conditions = expand_grid(GR = c(0.5, 1, 2), 
                         SDR = c(0.5, 1, 2),
                         pooled_sd = 1,
                         n = 300,
                         M_diff = 0.3) %>% 
  rowid_to_column(var = "condition") %>% 
  mutate(
    n1 = round(n / (GR + 1)), n2 = round(n1 * GR),
    sd1 = sqrt(((pooled_sd^2 - ((n1 * n2) / (n1 + n2)^2) * M_diff^2) * (n1 + n2)) / (n1 + n2 * SDR^2)),
    sd2 = sd1 * SDR
    ) 
conditions %>% select(1:6)
# A tibble: 9 × 6
  condition    GR   SDR pooled_sd     n M_diff
      <int> <dbl> <dbl>     <dbl> <dbl>  <dbl>
1         1   0.5   0.5         1   300    0.3
2         2   0.5   1           1   300    0.3
3         3   0.5   2           1   300    0.3
4         4   1     0.5         1   300    0.3
5         5   1     1           1   300    0.3
6         6   1     2           1   300    0.3
7         7   2     0.5         1   300    0.3
8         8   2     1           1   300    0.3
9         9   2     2           1   300    0.3

Second try: Run simulation experiment

set.seed(39)
i = 10000
tic()
sims = map_dfr(1:i, ~ conditions) %>% # each condition 10,000 times
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(p.value = sim_ttest(n = n, M_diff = M_diff, GR = GR, SDR = SDR))
toc()
11.141 sec elapsed

Second try: Results

Show the code
sims %>% 
  group_by(GR, SDR) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(GR), P_p05, group = 1)) + 
  geom_point() + geom_line() +
  facet_wrap(vars(SDR), labeller = label_both) +
  labs(x = "Group size ratio", y = "Power")

Second try: Results

Show the code
sims %>% 
  group_by(n, M_diff, GR, SDR, n1, n2, sd1, sd2) %>% 
  summarise(P_p05 = mean(p.value < 0.05), .groups = "drop") %>% 
  mutate(implies = str_glue("n_c = {n1}, n_t = {n2}, sd_c = {round(sd1, 2)}, sd_t = {round(sd2, 2)}")) %>% 
  mutate(implies = fct_reorder(implies, P_p05)) %>% 
  ggplot(aes(P_p05, implies)) + geom_point(size = 3) +
  labs(x = "Power", y = "Condition") +
  theme(axis.text.y = element_text(size = 20))

Second try: Summary

XXX

Simulation for a priori power calculation: Summary

XXX

Questions?

Group exercise 2b

Group exercise 2b.

  • Introduce Poisson